418 research outputs found

    Two-Phase Biomedical Named Entity Recognition Using A Hybrid Method

    Full text link

    Writing clinical practice guidelines in controlled natural language

    Full text link
    Clinicians could benefit from decision support systems incorporating the knowledge contained in clinical practice guidelines. However, the unstructured form of these guidelines makes them unsuitable for formal representation. To address this challenge we translated a complete set of pediatric guideline recommendations into Attempto Controlled English (ACE). One experienced pediatrician, one physician and a knowledge engineer assessed that a suitably extended version of ACE can accurately and naturally represent the clinical concepts and the proposed actions of the guidelines. Currently, we are developing a systematic and replicable approach to authoring guideline recommendations in ACE

    Automatic extraction of candidate nomenclature terms using the doublet method

    Get PDF
    BACKGROUND: New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature. RESULTS: A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an abstract title. The total number of words included in the abstract titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature. CONCLUSION: The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article

    Looking at Cerebellar Malformations through Text-Mined Interactomes of Mice and Humans

    Get PDF
    WE HAVE GENERATED AND MADE PUBLICLY AVAILABLE TWO VERY LARGE NETWORKS OF MOLECULAR INTERACTIONS: 49,493 mouse-specific and 52,518 human-specific interactions. These networks were generated through automated analysis of 368,331 full-text research articles and 8,039,972 article abstracts from the PubMed database, using the GeneWays system. Our networks cover a wide spectrum of molecular interactions, such as bind, phosphorylate, glycosylate, and activate; 207 of these interaction types occur more than 1,000 times in our unfiltered, multi-species data set. Because mouse and human genes are linked through an orthological relationship, human and mouse networks are amenable to straightforward, joint computational analysis. Using our newly generated networks and known associations between mouse genes and cerebellar malformation phenotypes, we predicted a number of new associations between genes and five cerebellar phenotypes (small cerebellum, absent cerebellum, cerebellar degeneration, abnormal foliation, and abnormal vermis). Using a battery of statistical tests, we showed that genes that are associated with cerebellar phenotypes tend to form compact network clusters. Further, we observed that cerebellar malformation phenotypes tend to be associated with highly connected genes. This tendency was stronger for developmental phenotypes and weaker for cerebellar degeneration

    A scalable machine-learning approach to recognize chemical names within large text databases

    Get PDF
    MOTIVATION: The use or study of chemical compounds permeates almost every scientific field and in each of them, the amount of textual information is growing rapidly. There is a need to accurately identify chemical names within text for a number of informatics efforts such as database curation, report summarization, tagging of named entities and keywords, or the development/curation of reference databases. RESULTS: A first-order Markov Model (MM) was evaluated for its ability to distinguish chemical names from words, yielding ~93% recall in recognizing chemical terms and ~99% precision in rejecting non-chemical terms on smaller test sets. However, because total false-positive events increase with the number of words analyzed, the scalability of name recognition was measured by processing 13.1 million MEDLINE records. The method yielded precision ranges from 54.7% to 100%, depending upon the cutoff score used, averaging 82.7% for approximately 1.05 million putative chemical terms extracted. Extracted chemical terms were analyzed to estimate the number of spelling variants per term, which correlated with the total number of times the chemical name appeared in MEDLINE. This variability in term construction was found to affect both information retrieval and term mapping when using PubMed and Ovid

    Germline MC1R status influences somatic mutation burden in melanoma

    Get PDF
    The major genetic determinants of cutaneous melanoma risk in the general population are disruptive variants (R alleles) in the melanocortin 1 receptor (MC1R) gene. These alleles are also linked to red hair, freckling, and sun sensitivity, all of which are known melanoma phenotypic risk factors. Here we report that in melanomas and for somatic C>T mutations, a signature linked to sun exposure, the expected single-nucleotide variant count associated with the presence of an R allele is estimated to be 42% (95% CI, 15-76%) higher than that among persons without an R allele. This figure is comparable to the expected mutational burden associated with an additional 21 years of age. We also find significant and similar enrichment of non-C>T mutation classes supporting a role for additional mutagenic processes in melanoma development in individuals carrying R alleles
    • …
    corecore